new: Deploy and monitor ML models with GPUs on Amazon EKS #1020

shivkumr · 2024-07-27T05:18:29Z

What this PR does / why we need it:

This is a new lab to deploy and monitor ML model on Amazon EKS

Which issue(s) this PR fixes:

First PR for this new lab

Fixes # NA

Quality checks

My content adheres to the style guidelines
I ran make test module="<module>" it was successful (see https://github.com/aws-samples/eks-workshop-v2/blob/main/docs/automated_tests.md)

EKS Workshop
AI/ML on EKS
Deploy and Monitor GenAI Model on EKS
✔ Deploy and Monitor GenAI Model on EKS (1342913ms)
✔ Install Karpenter and KubeRay Operator (211251ms)
✔ Install Jupyterhub (444619ms)
✔ Model Training (60899ms)
✔ Model Inference (312600ms)
✔ Monitor GPU Workloads on EKS (4945ms)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

netlify · 2024-07-27T05:18:48Z

✅ Deploy Preview for eks-workshop ready!

Name	Link
🔨 Latest commit	`c729b67`
🔍 Latest deploy log	https://app.netlify.com/sites/eks-workshop/deploys/672bb70dae2af5000835519d
😎 Deploy Preview	https://deploy-preview-1020--eks-workshop.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

bkgardiner · 2024-09-04T21:10:38Z

Karpenter should be preinstalled in this lab as it doesn't really add much to the lab. Take a look at the Inference with AWS Inferentia lab (https://www.eksworkshop.com/docs/aiml/inferentia/). This lab comes with Karpenter preinstalled.

In the Jupyter Notebook commands section it would be nice to get some explanation on what this code is exactly doing. This would also enable to user to read through the explanation while waiting for the code to be executed.

Other than that this lab looks really good! Thank you for creating it.

bkgardiner

This looks good to me.

svennam92

Awesome! Some comments:

The prepare-environment block should have a link out to the Terraform like in other modules
Please explain hardware infrastructure being used. Can we outline the Karpenter nodepools that are created for the user? Explain what g5 instances are and why we need them for the lab. Example: https://eksworkshop.com/docs/aiml/chatbot/nodepool
The titles are for the AI/ML modules are not distinct enough. How about something like "Training StableDiffusion on NVIDIA GPUs". Deploying and inference is implied when we're doing training.

shivkumr added 3 commits July 15, 2024 21:11

deploy-monitor-genai-model lab first commit

33bbed9

updated the syntax error in the kuberay install command

2e1e6f7

fixed all the module test errors

3b750b6

shivkumr requested a review from a team as a code owner July 27, 2024 05:18

shivkumr changed the title ~~New Lab AIML lab to deploy and monitor ML model on Amazon EKS~~ New AIML lab to deploy and monitor ML model on Amazon EKS Jul 27, 2024

bkgardiner self-assigned this Sep 4, 2024

Moved Karpenter installation into the preinstall script

96ca496

bkgardiner approved these changes Sep 12, 2024

View reviewed changes

niallthomson changed the title ~~New AIML lab to deploy and monitor ML model on Amazon EKS~~ new: Deploy and monitor ML models with GPUs on Amazon EKS Sep 26, 2024

niallthomson added 3 commits September 26, 2024 16:09

Merge branch 'main' into main

f818d9a

Update index.md

dca61f4

Require intro update for consistency, fixed spelling in file name

a69cba7

svennam92 requested changes Sep 27, 2024

View reviewed changes

niallthomson added this to the Release 10/25 milestone Sep 30, 2024

niallthomson added the content/aiml label Sep 30, 2024

niallthomson added 2 commits October 28, 2024 20:29

Various fixes to bring lab in line with rest of the workshop

b9995fe

Fix describe permissions

3021994

niallthomson removed this from the Release 10/25 milestone Nov 1, 2024

Merge branch 'main' into main

c729b67

shivkumr closed this by deleting the head repository Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new: Deploy and monitor ML models with GPUs on Amazon EKS #1020

new: Deploy and monitor ML models with GPUs on Amazon EKS #1020

shivkumr commented Jul 27, 2024

netlify bot commented Jul 27, 2024 •

edited

Loading

bkgardiner commented Sep 4, 2024

bkgardiner left a comment

svennam92 left a comment •

edited

Loading

new: Deploy and monitor ML models with GPUs on Amazon EKS #1020

new: Deploy and monitor ML models with GPUs on Amazon EKS #1020

Conversation

shivkumr commented Jul 27, 2024

What this PR does / why we need it:

Which issue(s) this PR fixes:

Quality checks

netlify bot commented Jul 27, 2024 • edited Loading

✅ Deploy Preview for eks-workshop ready!

bkgardiner commented Sep 4, 2024

bkgardiner left a comment

Choose a reason for hiding this comment

svennam92 left a comment • edited Loading

Choose a reason for hiding this comment

netlify bot commented Jul 27, 2024 •

edited

Loading

svennam92 left a comment •

edited

Loading